Goto

Collaborating Authors

 pruning method



Towards Data-Agnostic Pruning At Initialization: What Makes a Good Sparse Mask?

Neural Information Processing Systems

PaI methods manage to find trainable subnetworks that outperform random pruning, their performance in terms of both accuracy and computational reduction is far from satisfactory compared to post-training pruning and the understanding of PaI is missing.


Towards Data-Agnostic Pruning At Initialization: What Makes a Good Sparse Mask?

Neural Information Processing Systems

PaI methods manage to find trainable subnetworks that outperform random pruning, their performance in terms of both accuracy and computational reduction is far from satisfactory compared to post-training pruning and the understanding of PaI is missing.




Appendix for You Only Condense Once: Two Rules for Pruning Condensed Datasets

Neural Information Processing Systems

The augmentations include: Color: adjusts the brightness, saturation, and contrast of images. Flip: flips the images horizontally with a probability of 0.5. It happens at a probability of 0.5. The first part is the update of the synthetic images. The second part is the update of the network.


Appendix A and Generalization

Neural Information Processing Systems

The directional derivative of the loss function is closely related to the eigenspectrum of mNTKs. For deep models, as mentioned in (Hoffer et al., 2017), the weight distance from its initialization Combining Lemma 2 and Eq. 18, we can discover that as training iterations increase, the model's Rademacher complexity also grows with its weights more deviated from initializations, which We generally follow the settings of Liu et al. (2019) to train BERT All baselines of VGG are initialized with Kaiming initialization (He et al., 2015) and are trained with SGD for Network pruning (Frankle & Carbin, 2018; Sanh et al., 2020; Liu et al., 2021) applies various criteria MA T is the first work to employ the principal eigenvalue of mNTK as the module selection criterion. Table 5 compares the extended MA T, the vanilla BERT model, and SNIP (Lee et al., 2018b) in terms In our implementation, we apply SNIP in a modular manner by calculating the connection sensitivity of each module. In contrast, using the criteria of MA T, we prune 50% of the attention heads while training the remaining ones by MA T. This approach leads to a further acceleration of computations by 56.7% Turc et al. (2019), we apply the proposed MA T to BERT models with different network scales, namely


itself from prior works on Bayesian sparse neural network by imposing a spike-and-slab prior with the Dirac spike

Neural Information Processing Systems

We thank the reviewers for their positive comments and constructive suggestions. The paper will be updated accordingly in the camera-ready version. Hence automatically, all posterior samples are from exact sparse DNN models. Note that more experiments will be added in the final version. NIPS 2017) to serve the purpose of faster prediction.


DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

Neural Information Processing Systems

Diffusion models have achieved remarkable progress in the field of image generation due to their outstanding capabilities. However, these models require substantial computing resources because of the multi-step denoising process during inference. While traditional pruning methods have been employed to optimize these models, the retraining process necessitates large-scale training datasets and extensive computational costs to maintain generalization ability, making it neither convenient nor efficient. Recent studies attempt to utilize the similarity of features across adjacent denoising stages to reduce computational costs through simple and static strategies.